Bayesian penalized likelihood PET reconstruction impact on quantitative metrics in diffuse large B-cell lymphoma

Evaluate the quantitative, subjective (Deauville score [DS]) and reader agreement differences between standard ordered subset expectation maximization (OSEM) and Bayesian penalized likelihood (BPL) positron emission tomography (PET) reconstruction methods. A retrospective review of 104 F-18 fluorodeoxyglucose PET/computed tomography (CT) exams among 52 patients with diffuse large B-cell lymphoma. An unblinded radiologist moderator reviewed both BPL and OSEM PET/CT exams. Four blinded radiologists then reviewed the annotated cases to provide a visual DS for each annotated lesion. Significant (P < .001) differences in BPL and OSEM PET methods were identified with greater standard uptake value (SUV) maximum and SUV mean for BPL. The DS was altered in 25% of cases when BPL and OSEM were reviewed by the same radiologist. Interobserver DS agreement was higher for OSEM (>1 cm lesion = 0.89 and ≤1 cm lesion = 0.84) compared to BPL (>1 cm lesion = 0.85 and ≤1 cm lesion = 0.81). Among the 4 readers, average intraobserver visual DS agreement between OSEM and BPL was 0.67 for lesions >1cm and 0.4 for lesions ≤1 cm. F-18 Fluorodeoxyglucose PET/CT of diffuse large B-cell lymphoma reconstructed with BPL has higher SUV values, altered DSs and reader agreement when compared to OSEM. This report finds volumetric PET measurements such as metabolic tumor volume to be similar between BPL and OSEM PET reconstructions. Efforts such as adoption of European Association Research Ltd accreditation should be made to harmonize PET data with an aim at balancing the need for harmonization and sensitivity for lesion detection.


Introduction
Positron emission tomography (PET) image reconstruction has evolved from filtered backprojection to iterative methods such as ordered subset expectation maximization (OSEM), to more recent methods like point spread function (PSF) and Bayesian penalized likelihood (BPL) reconstruction algorithms such as Q.Clear (GE Healthcare, Waukesha, WI). [1] Q.Clear utilizes PSF modeling while taking input from surrounding voxels as a penalty for increasingly higher levels of noise inherit within subsequent image reconstruction iterations. The penalty factor (β) is adjustable and allows users to tune the level of image noise.
Phantom spatial resolution of PSF and BPL reconstruction has been reported higher than OSEM with regard to objects smaller than 1 cm. [2][3][4][5][6][7][8] Some data point to BPL having higher spatial resolution when compared to PSF. [8] Clinically, BPL reconstruction reduces noise, especially in larger patients and within smaller lesions. [9][10][11][12][13][14][15] The maximum standard uptake value (SUVmax) body weight (BW, SUVmax) is a semi-quantitative metric for intensity of radiotracer activity within a region of interest accounting for amount of injected dose and BW while correcting for radioactive time decay [SUVmax = ROI ID/BW ]. Typically, SUVmax values of PSF and BPL are higher than OSEM (especially in smaller Medicine lesions) and this difference tends to be more pronounced in tumors relative to normal liver. [14][15][16][17][18][19][20][21] Alterations in relative SUVmax values, and the relationship between lesion SUV compared to liver, raise concern around accuracy and consistency from interpreting physicians, including when imaging lymphoma. [22] Reader agreement using OSEM reconstructed PET data are generally moderate to high. [23][24][25][26][27] Prior studies have shown moderate interobserver agreement between the visual Deauville scores (DS) when imaging diffuse large B-cell lymphoma (DLBCL) with OSEM reconstructed F-18 fluorodeoxyglucose (FDG) PET exams. [28][29][30][31] However, there is sparce data on intra and interobserver agreement between different methods of PET reconstruction when imaging DLBCL. Here we assess the difference between OSEM and BPL (Q.Clear) regarding PET computed tomography (CT) quantitative metrics and DS reader agreement within a group of patients with DLBCL.

Patients
After approval by the Mayo Clinic Institutional Review Board, a single institution retrospective review was initiated by running a query within a single tertiary referral center patient database searching for sequential adult patients with exam indications containing the terms "DLBCL" and "B cell" between January 1, 2016 and August 8, 2018.

PET/CT technique
Patients were instructed to follow a dietary fast for at least 4 hours prior to the exam. Exams were performed using standard PET/CT equipment (GE Discovery 710 and GE Discovery MI) with a clinical oncology technique (PET: 3-5-minute acquisition per bed based on patient body mass index, 192 × 192 matrix, 70 cm field of view. CT: non-enhanced, modulated ~90 mAs, 120 kVp, pitch 1.0, 3.75-5.0 mm slice thickness). The PET exams were reconstructed using both OSEM and BPL (Q.Clear) techniques. BPL reconstruction utilizes time-of-flight and PSF information. The β value used for BPL reconstruction was 300.

PET/CT interpretation
Five board-certified Nuclear Radiologists with 3 (JRY), 3 (ECE), 3 (ATP), 9 (ACH), and 24 (MAN) years of clinical experience reviewed the cases using the same standard clinical interpretation hardware and software. One Radiologist moderator (JRY) reviewed the cases and saved sessions with annotations highlighting the most FDG avid lesion >1 cm in addition to the most FDG avid lesion 1 cm and less when present for both the OSEM and BPL reconstructions. The remaining 4 Radiologists served as blinded reviewers. The moderator used MIM Software Inc. (Cleveland, OH), to segment the lesions using an automated method (PET Edge) with resulting BW SUVmax, SUVmean, metabolic tumor volume (MTV) and total lesion glycolysis (TLG) measured for both OSEM and BPL PET reconstructions. Lesions were measured manually for size using the CT images from the PET/CT exam. Physiologic uptake in the liver and thoracic aorta blood pool were measured using 3 cm and 1.5 cm sphere regions of interest, respectively. Signal to noise ratio of the reconstructions were calculated as mean SUV divided by the standard deviation of the SUV in the liver region.
The OSEM and BPL sessions were randomized and distributed to the reviewers in increments of 10 sessions spaced out by at least 2 weeks. Prior to interpretation, all Radiologists underwent a training session where it was agreed to provide DSs visually using the rotating maximum intensity projection images. DSs were to be provided using the Lugano classification where a score of 1 (DS1) is no abnormal uptake, a score of 2 (DS2) is uptake less than or equal to mediastinal blood pool, a score of 3 (DS3) is uptake between mediastinal blood pool and liver, a score of 4 (DS4) is uptake moderately greater than liver and a score of 5 (DS5) is uptake markedly higher than liver. [31,32] The Radiologists independently reviewed each session in a blinded fashion.

Statistical analysis
Analysis of the data was performed using SAS (version 9.4, Boston, MA) and BlueSky (version 7.40, Cary, NC). Continuous data is reported as a mean and range or standard deviation. Continuous variables were correlated using a paired t test. Categorical data is reported with absolute values and relative frequencies. Interobserver agreement was calculated using Kendall coefficient of concordance and intraobserver agreement was calculated using weighted κ. The overall DS among 4 interpreters is reported as the median. A P value <.05 is considered statistically significant.

Results
The initial search produced 223 unique PET/CT exams, of which 112 were reconstructed with both OSEM and BPL (Q.Clear) methods. Five exams were excluded due to the following: irretrievable image data (2), exam from outside institution (2), large amount of radiotracer extravasation (1). Three exams were used for training purposes. A total of 104 unique exams among 52 patients were available for final analysis. Each exam was reconstructed using both OSEM and BPL (Q.Clear) methods producing 208 total sessions for interpretation.
Automated segmentation using PET Edge (MIM Software Inc., Cleveland, OH) required manual input by the moderator for accurate segmentation in 23% (16/69) of lesions >1 cm and 16% (10/62) of lesions ≤1 cm. Inaccurate automatic segmentation requiring manual input was often due to lesions near structures with high physiologic FDG uptake such as the brain, kidneys, and urinary bladder.
Patient demographics, phase of care and physiologic PET related factors are shown in Table 1. Normal liver SUV mean and max values were slightly lower for the BPL compared to OSEM (P ≤ .001) while the signal to noise ratio was slightly higher for BPL. Blood pool SUVmean was slightly lower for BPL compared to OSEM (P ≤ .001) while there was no significant difference for SUVmax.
The differences between OSEM and BPL quantitative PET metrics for FDG avid lesions are depicted in Tables 2 and 3 with stratification between lesion diameters >1 cm and those ≤1 cm. For both lesion sizes, SUV max and mean values were significantly (P < .001) greater for BPL versus OSEM reconstructions. Of all lesions ≥1 cm, BPL had a mean 26% greater SUVmax compared to OSEM. Of all lesions ≤1 cm, BPL had a mean 112% greater SUVmax compared to OSEM. For lesions >1 cm, BPL shifted 13% of SUVmax values from less than, to greater than liver. For lesions ≤1 cm, BPL shifted 34% of SUVmax values from less than, to greater than liver. There was no significant difference between BPL and OSEM for TLG or MTV in lesions >1 cm. Lesions ≤1 cm had significantly lower MTV by 52% when reconstructed with BPL compared to OSEM (P < .001).
The lesion-to-liver SUVmax ratios were 36% and 132% higher with BPL compared to OSEM for lesions >1 cm and ≤1 cm, respectively.
Changes in visual DS and reader agreement between OSEM and BPL are shown in Table 4. To explore possible clinical implications, the highest DS between lesions >1 cm and ≤1 cm was utilized for each exam and the median DS between the interpreting Radiologists was used as a final, summed DS. There was no DS change in 75% (78/104) of cases. However, 23% (24/104) of cases were upgraded with higher DS with BPL compared to OSEM. Of the upgraded cases, 9% (9/104) were shifted into the DS4/DS5 range. While 1.9% (2/104) were downgraded into the DS1 and DS2 range. The interobserver agreement among the 4 readers was very good to excellent for both OSEM (0.81-0.89) and BPL (0.75-0.85) PET reconstructions. The intraobserver agreement between OSEM and BPL was good to very good (0.63-0.73) for lesions >1 cm and fair (0.29-0.44) for lesions ≤1 cm.

Discussion
Debate surrounds differences between OSEM and BPL PET reconstruction methods when determining DS. [22,33] This study found significantly higher SUVmax and SUVmean values when measured with BPL compared to OSEM. The difference between BPL and OSEM for volumetric PET data (MTV and TLG) was only significant for MTV when lesions were <1 cm. The visual DS within individual patients was changed in 25% of cases when OSEM and BPL was reviewed by the same reader. Interobserver visual DS agreement was good to excellent for both OSEM and BPL. However, intraobserver visual DS agreement between OSEM and BPL PET reconstructions was fair.
Our findings within a population of DLBCL patients align with prior reports of higher BPL SUV values compared to OSEM PET reconstruction in a variety of tumors. [14][15][16][17][18][19][20][21] Lymphoma lesions are often evaluated relative to normal liver and blood pool FDG uptake. [31,32] A rise in lesion SUV values with BPL could be controlled by reporting lesion-to-liver SUV ratios and methods as such have been proposed. [34] However, lesion-to-liver SUVmax ratios within our study had an average BPL ratio of 5.5 and OSEM ratio of 4.0 for lesions >1cm, with more drastic     Table 3 Quantitative PET metrics for lymphomatous lesions ≤1 cm in diameter.

Lesion ≤1 cm n = 62
Mean SUVmax differences for lesions ≤1 cm with an average BPL ratio of 4.0 and OSEM ratio of 1.7. Therefore, caution should be used when using BPL SUVmax values for staging DLBCL, especially in lesions ≤1 cm. That said, volumetric PET data seems somewhat immune to differences between BPL and OSEM reconstruction methods. There were no significant differences between MTV and TLG for lesions >1 cm and only a significant difference in MTV for lesions ≤1 cm. However, differences in tumor volumes <1 cc are of questionable clinical significance such as what is shown in Figure 1. These findings suggest adoption of volumetric PET data may improve consistency of results when imaging DLBCL between different PET reconstruction techniques.
In 2018 Enilorac et al found a minor risk of altering DS or change in clinical outcomes of lymphoma patients when comparing PSF with a European Association Research Ltd (EARL) harmonized filter. [35] Enilorac et al used quantitative SUV thresholds for obtaining DS and when DS was grouped into DS1 to DS3 and DS4 to DS5 categories, the frequency of discordance was 3.2% at end of treatment and 5.0% at interim exams. In 2020 Wyrzykowski et al evaluated the impact of BPL (Q.Clear) compared to OSEM on lymphoma DS using lesion SUVmax to normal liver/ blood pool thresholding and found an overall discordance of 15.7% with 7.1% converting to DS4/DS5 with BPL. [36] The visual DS method of this study found an overall discordance of 25% when using the median DS of the 4 Radiologists. Our group had statistically significant elevations of DS with BPL compared to OSEM, most shifted from DS4 to DS5 (7.7%) which may not be clinically significant. However, 6.7% (7/104) shifted from DS3 to DS4 and 1.9% (2/104) shifted from DS1 to DS4 with BPL compared to OSEM which is more likely to impact patient care.
Previously reported DS interobserver FDG PET/CT agreement has ranged from 0.35 to 0.87 with evidence for improvements when using a training session prior to data collection. [28][29][30] We report a slightly higher agreement for OSEM compared to BPL with Kendall coefficient of concordance scores of 0.81 and 0.75, respectively. Our interobserver agreement is at the high end of prior reports which may be attributable to the training session incorporated into our method.
Intraobserver FDG PET/CT visual DS agreement when interpreted with and without clinical information has been reported at 0.48 and 0.62, respectively. [30] Among our 4 reviewers, the average intraobserver agreement between OSEM and BPL was 0.67 (0.63-0.73) for lesions >1 cm and 0.40 (0.29-0.44) for lesions ≤1 cm (weighted κ). While our results are similar to prior reports for lesions >1 cm, the intraobserver agreement for lesions ≤1 cm is lower. Therefore, greater variation between OSEM and BPL reconstruction results may be expected when interpreting small lymphomatous lesions.
There are weaknesses inherent to this retrospective review. A small to moderate number of patients (52) and exams (104) resulted for review. There were 18% of exams without FDG avid lesions which may increase observer agreement. However, Table 4 PET visual Deauville score (DS) observer agreement.

FDG PET/CT DS interpretation
Observer agreement (95% CI) this rate mimics clinical practice where not all exams are positive. The intra-observer agreement between OSEM and BPL exams was subject to recency bias. However, reviewers were blinded, interpreted exams randomly and in increments of 10 sessions spaced out by 2-week intervals to help minimize fatigue and recency bias. That said, there may be less clinical relevance of this review due to the blinded method and inclusion of lesions ≤1 cm which may be deemed inconsequential clinically. However, there is evidence pointing to little impact of clinical information on observer agreement in this context. [30] Further, to ensure observers were interpreting the exact same lesion, a session was saved within the reviewing software annotating lesions for DS assignment which is subject to leading bias. Currently, a mix of PET technologies in practice pose a challenge to clinical care and research. A common solution is producing two PET reconstructions: harmonized for quantitative measurement and optimized for lesion detection. Various organizations have worked towards PET standardization among different vendors including the European Association of Nuclear Medicine which launched the EARL accreditation program in 2006. The first generation of PET accreditation (EARL1) was initiated in 2010 and updated in 2015. [37,38] Recognizing the need to account for emerging advanced PET techniques such as time-of-flight and PSF reconstruction, EARL2 was developed and validated. [39] Becoming accredited with EARL will minimize differences in quantitative PET data. Additionally, PET reader agreement can be improved with PET harmonization. [29,40] Progress in clinical PET harmonization has been made. [41] However, there is a persistent challenge of transitioning to more advanced PET techniques.

Conclusion
Growing evidence point to a relevant DS shift between OSEM and BPL PET reconstruction methods when imaging lymphoma with FDG, yet volumetric PET data such as MTV and TLG seem less prone to different PET techniques. While interobserver agreement between DS is high for both OSEM and BPL, the intraobserver agreement suffers for smaller lesions. PET harmonization using EARL accreditation methods is encouraged to optimize patient care and research.Acknowledgments The authors acknowledge statistical methods consultation from the Biostatistics, Epidemiology and Research Design resource of Mayo Clinic's Center for Clinical and Translational Science, and would like to thank Sonia Watson, PhD, for assistance in preparation of the manuscript.